03:00
MATH 70076 - Data Science
Mid-module feedback
menti.com
54 08 92 6
Suppose we have a model \(f\), which takes as inputs predictors \(X\) and model parameters \(\theta\), to produce predictions \(f(X,\theta)\) of outcomes \(y\).
We pick our model parameters \(\hat \theta\) and obtain predictions \(\hat y = f(X, \hat \theta)\) by optimising some loss function \(L(y, \hat y)\), e.g.:
\[\hat\theta = \arg\min_{\theta}\frac{1}{n}\sum_{i=1}^{n} [y_i - f(X, \theta)]^2\]
Our model \(f(X, \hat\theta)\) is a mapping between predictor space and response space.
This mapping is not necessarily simple or straight-forward to explain. But before the model can be put into production we will most likely have to:
These tasks are ambiguous
With the person next to you, identify at least two interpretations of each.
03:00
Strong evidence that the predictor influences the outcome (statistical significance)
The value of the predictor has a large influence on the predicted value (large effect size)
Usually interested in the combination of these (meaningful effect)
Catalytic Predictors
When considering predictor importance, this might depend on which other predictors are in the model and how the model allows these to interact with one another. If first-order or second-order interactions are not included in the model then predictor importance could be masked.
03:00
What approaches might we take to quantifying predictor importance?
With the person next to you discuss how you might assess the importance of a predictor within an particular model.
To investigate the importance of a predictor within a model we want to investigate how much worse the model would be without that predictor.
Destroy connection by shuffling/ resampling the values of the \(p^{\text{th}}\) predictor.
(Taking care to propagate this into AR and interaction terms as needed)
Calculate \(L_0 = L(y,\ f(X, \hat\theta_X))\)
Calculate \(L_1 = L(y,\ f(X_{\tilde p}, \hat \theta_{X_{\tilde p }}))\)
If \(|L_1 - L_0|\) is large, then the feature was important to the model.
What do we mean by ‘large’?
Discuss with the person next to you how we might quantify whether the change in model performance is large.
03:00
Calculate \(L_0 = L(y,\ f(X, \hat\theta_X))\)
For \(i = 1,... m:\) Calculate \(L_i = L(y,\ f(X_{\tilde p}^{(i)}, \hat \theta_{X_{\tilde p}^{(i)}}))\)
For pairs of loss function values \(\{i, j : 0 \leq i < j \leq m\}\) calculate \(D_{ij} = L_i - L_j\).
Compare the distributions of \({D_{ij} : i = 0}\) to \(D_{ij}: i > 0\).
This should be a familiar idea
This is a similar concept to a LRT or a NHST of \(\hat\beta_p = 0\) vs \(\hat\beta_p \neq 0\) in a linear model, but non-parametric and for an arbitrary model.
Counterfactual modelling
Adjust the values of one (or more) predictors and see what prediction would have been made. Leads to various methods to quantify and visualise the effect of each predictor on the response.
Benefits:
Care needed:
For each individual, vary the predictor value and plot how the outcome changes.
\[\hat y(x_{ip}) = f(x_{ip}; \hat\theta_X)\]
Useful for detecting direct and first-order effects.
04:00
What would an ICE plot look like for a covariate with no predictive power?
How might an age:vaccine interaction show on this plot?
Point-wise mean of ICE
Shows the average effect of predictor at population level.
Not for any individual or the “average” individual.
04:00
Test your understanding
\[ Y \sim \mathrm{N}(X\beta + \eta, \sigma^2) \quad \text{ where } \quad \eta_i \sim N(0, \tau^2).\]
\[Y_i \sim \mathrm{Bern}(Z_i) \quad \text{ where } \quad Z_i = \frac{\exp\{X\beta + \eta\}}{1 + \exp\{X\beta + \eta\}}\quad \text{ and } \quad \eta_i \sim N(0, \tau^2).\]
Use an explainable model to construct a local approximation \(g\) to the true response surface \(f\).
e.g. using Local linear regression for \(g\):
\[ g(x^\prime) = \hat \beta_0 + \beta_1 x\] where
\[ (\hat \beta_0, \hat \beta_1) = \underset{\beta_0, \beta_1}{\arg\min} \sum_{i=1}^{n} w(x_i - x^\prime) [y_i - \hat\beta_0 - \hat\beta_1 x_i]^2.\]
Is a linear model the only / best choice? No.
How do we pick \(w\)? Any kernel function: hat, Gaussian, Epanechnikov…
How do we pick bandwidth: Tricky and context dependent. LOOCV on “local-ish” points.
Do we have to use evaluations at other observations to construct \(g\)? No! Augmentation is good but leads us into experimental design territory.
Example explanation of a classification model with 2 predictors.
All previous explanations have been model specific.
Interpretation of many models is dependent on which predictors are included in the model.
SHAP shows feature importance over a family of \(2^p\) models.
Similar to permutation testing with some important differences:
Chapter 9.6 of Interpretable ML gives a more detailed explanation of SHAP calculations
Example application:
Many explainability methods, which is best depends on:
Preparing for Production - Explain and Scale - Zak Varty